Training Multi-Modal AI: Inside the Jina CLIP Embedding Model | S2 E11

Update: 2024-10-25

Description

Today we are talking to Michael Günther, a senior machine learning scientist at Jina about his work on JINA Clip.

Some key points:

Uni-modal embeddings convert a single type of input (text, images, audio) into vectors
Multimodal embeddings learn a joint embedding space that can handle multiple types of input, enabling cross-modal search (e.g., searching images with text)
Multimodal models can potentially learn richer representations of the world, including concepts that are difficult or impossible to put into words

Types of Text-Image Models

CLIP-like Models
- Separate vision and text transformer models
- Each tower maps inputs to a shared vector space
- Optimized for efficient retrieval
Vision-Language Models
- Process image patches as tokens
- Use transformer architecture to combine image and text information
- Better suited for complex document matching
Hybrid Models
- Combine separate encoders with additional transformer components
- Allow for more complex interactions between modalities
- Example: Google's Magic Lens model

Training Insights from Jina CLIP

Key Learnings
- Freezing the text encoder during training can significantly hinder performance
- Short image captions limit the model's ability to learn rich text representations
- Large batch sizes are crucial for training embedding models effectively
Training Process
- Three-stage training approach:
  - Stage 1: Training on image captions and text pairs
  - Stage 2: Adding longer image captions
  - Stage 3: Including triplet data with hard negatives

Practical Considerations

Similarity Scales
- Different modalities can produce different similarity value scales
- Important to consider when combining multiple embedding types
- Can affect threshold-based filtering
Model Selection
- Evaluate models based on relevant benchmarks
- Consider the domain similarity between training data and intended use case
- Assessment of computational requirements and efficiency needs

Future Directions

Areas for Development
- More comprehensive benchmarks for multimodal tasks
- Better support for semi-structured data
- Improved handling of non-photographic images
Upcoming Developments at Jina AI
- Multilingual support for Jina ColBERT
- New version of text embedding models
- Focus on complex multimodal search applications

Practical Applications

E-commerce
- Product search and recommendations
- Combined text-image embeddings for better results
- Synthetic data generation for fine-tuning
Fine-tuning Strategies
- Using click data and query logs
- Generative pseudo-labeling for creating training data
- Domain-specific adaptations

Key Takeaways for Engineers

Be aware of similarity value scales and their implications
Establish quantitative evaluation metrics before optimization
Consider model limitations (e.g., image resolution, text length)
Use performance optimizations like flash attention and activation checkpointing
Universal embedding models might not be optimal for specific use cases

Michael Guenther

Nicolay Gerold:

00:00 Introduction to Uni-modal and Multimodal Embeddings 00:16 Exploring Multimodal Embeddings and Their Applications 01:06 Training Multimodal Embedding Models 02:21 Challenges and Solutions in Embedding Models 07:29 Advanced Techniques and Future Directions 29:19 Understanding Model Interference in Search Specialization 30:17 Fine-Tuning Jina CLIP for E-Commerce 32:18 Synthetic Data Generation and Pseudo-Labeling 33:36 Challenges and Learnings in Embedding Models 40:52 Future Directions and Takeaways

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

Search in 5 lines of code. Building a search database from first principles | S2 E29

2025-03-1353:29

RAG is two things. Prompt Engineering and Search. Keep it Separate | S2 E28

2025-03-0601:02:44

Graphs aren't just for specialists anymore. They are one import away | S2 E27

2025-02-2801:03:35

Knowledge Graphs Won't Fix Bad Data | S2 E26

2025-02-2001:10:59

Temporal RAG: Embracing Time for Smarter, Reliable Knowledge Graphs | S2 E25

2025-02-1301:33:44

Context is King: How Knowledge Graphs Help LLMs Reason

2025-02-0601:33:35

Inside Vector Database Quantization: Product, Binary, and Scalar | S2 E23

2025-01-3152:12

Local-First Search: How to Push Search To End-Devices | S2 E22

2025-01-2353:09

AI-Powered Search: Context Is King, But Your RAG System Ignores Two-Thirds of It | S2 E21

2025-01-0901:14:24

Chunking for RAG: Stop Breaking Your Documents Into Meaningless Pieces | S2 E20

2025-01-0349:13

How AI Can Start Teaching Itself - Synthetic Data Deep Dive | S2 E18

2024-12-1948:11

A Search System That Learns As You Use It (Agentic RAG) | S2 E18

2024-12-1345:30

Rethinking Search Inside Postgres, From Lexemes to BM25 | S2 E17

2024-12-0547:16

RAG's Biggest Problems & How to Fix It (ft. Synthetic Data) | S2 E16

2024-11-2851:26

From Ambiguous to AI-Ready: Improving Documentation Quality for RAG Systems | S2 E15

2024-11-2146:37

BM25 is the workhorse of search; vectors are its visionary cousin | S2 E14

2024-11-1554:05

Vector Search at Scale: Why One Size Doesn't Fit All | S2 E13

2024-11-0736:26

Search Systems at Scale: Avoiding Local Maxima and Other Engineering Lessons | S2 E12

2024-10-3154:47

Training Multi-Modal AI: Inside the Jina CLIP Embedding Model | S2 E11

2024-10-2549:22

Building the database for AI, Multi-modal AI, Multi-modal Storage | S2 E10

2024-10-2344:54

00:00

1.0x

Training Multi-Modal AI: Inside the Jina CLIP Embedding Model | S2 E11

Nicolay Gerold

We and our partners use cookies to personalize your experience, to show you ads based on your interests, and for measurement and analytics purposes. By using our website and our services, you agree to our use of cookies as described in our Cookie Policy.

#box-pro-ellipsis-174198694387093{-webkit-line-clamp:2;}Training Multi-Modal AI: Inside the Jina CLIP Embedding Model | S2 E11

Training Multi-Modal AI: Inside the Jina CLIP Embedding Model | S2 E11

Nicolay Gerold

Training Multi-Modal AI: Inside the Jina CLIP Embedding Model | S2 E11